GAN photo editing

CMU 16-726 Image Synthesis S22

Tomas Cabezon Pedroso

In this project, we will explore the posibilites of image synthesis and editing using GAN latent space. First, we will start by inverting a pre-trained generator to find a latent variable that closely reconstructs a given real image. In the second part of the assignment, we will interpolate between two images in the latent space and we will finisht with image editing. We will take a hand-drawn sketch and generate an image that fits the sketch accordingly, then wi will use these sketches to edit a given image.

This project is based in the following two articles: Generative Visual Manipulation on the Natural Image Manifold and Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space?.

Inverting the Generator

For the first part of the assignment, you will solve an optimization problem to reconstruct the image from a particular latent code. Natural images lie on a low-dimensional manifold and we choose to consider the output manifold of a trained generator as close to the natural image manifold. So, we can set up the following nonconvex optimization problem:

For some choice of loss $L$ and trained generator $G$ and a given real image $x$ , we can write

We choose a combination of pixel and perceptual loss, as the standard Lp losses do not work well for image synthesis tasks. We also tried BCE loss but didn't give good results. For the implementation of this part of this assigment we rehuse what we learnt in assigment 4, Neural Style transfer. As this is a nonconvex optimization problem where we can access gradients, we attempt to solve it with any first-order or quasi-Newton optimization method (in our case, LBFGS).

Perceptual and Pixel Loss: The content loss is a metric function that measures the content distance between two images at a certain individual layer. Denote the Lth-layer feature of input image X as f_X^Land that of target content image as f_C^L. The content loss is defined as squared L2-distance of these two features:

$L=\frac{1}{2}\sum(f_X^L-f_C^L)^{2}$
To extract the feature, a VGG-19 net pre-trained on ImageNet is used. The pre-trained VGG-19 net consists of 5 blocks (conv1-conv5) (with a total of 15 conv layers) and each block serves as a feature extractor at the different abstract levels. As we saw in previous assigment, the layer election has a big the influence on the results. For this assigment, we have used conv_5 layer, as it outputs best results.

For pixel loss, we implement the L1 loss over the pixel space.

Results

In this part of the assgiment we try different possible combinations of layers of VGG-19, different perceptual and pixel loss weights as well as using different latent spaces (z, w and w+). We have seen that the best results are the ones using conv_5 layer, 0.01 weight for the perceptual loss and 10. for the pixel loss. Nevertheless, we will use other weights for the following parts of the assigment. We also compare the outputs of a vnailla GAN and a styleGAN and as it was expected, the seconds outputs better results. We optimize the images during 1000 iterations as more optimization time does not result in better output quality.

Original image

Vanilla GAN z

StyleGAN z

StyleGAN w

StyleGAN w+

Interpolations

We use StyleGAN and w+ space to embedded two images into the latent space and output the images of their interpolation. In the following images and GIFs we can observe that the transitions are smooth and neat.

Original image A

Embedded image A

Interpolation

Embedded image B

Original image B

Original image A

Embedded image A

Interpolation

Embedded image B

Original image B

Original image A

Embedded image A

Interpolation

Embedded image B

Original image B

In the top right images we can see how the plants disapear on the embedded images, this is a good example of how stylegans keep the important overall features of the data they are trained on but dont learn smaller details.

Scribble to Image

We can treat the scribble similar to the input reconstruction. In this part, we have a scribble and a mask, so we can modify the latent vector to yield an image that looks like the scribble.To generate an image subject to constraints, we solve a penalized non-convex optimization problem. We’ll assume the constraints are of the form fi(x)=vi for some scalar-valued functions fi and scalar values vi. Written in a form that includes our trained generator G, this soft-constrained optimization $G$ problem is:

Given a user color scribble, we would like GAN to fill in the details. Say we have a hand-drawn scribble image $s \in \R^d$ with a corresponding mask $m \in {0, 1}^d$ . Then for each pixel in the mask, we can add a constraint that the corresponding pixel in the generated image must be equal to the sketch, which might look like $m_{i} x_{i} = m_{i} s_{i}$ Since our color scribble constraints are all elementwise, we can reduce the above equation under our constraints to

where $*$ is the Hadamard product, $M$ is the mask, and $S$ is the sketch. For the results bellow, we have used a perceptual loss weight of 0.05 and L1 loss of 5.

Sketch

Embedded image

Sketch

Embedded image

Sketch

Embedded image

Sketch

Embedded image

Image Editing

Similar to in previous section, we will use the perceptual and pixel loss to edit an images. However, in this case we will embed the initial image in the latent space and then apply the sketch to edit it. In the following images some of the results of this image editing are shown. This images have been optained using conv_4 layer to calculate the loss. We can observe how some of the colors in the sketches are not present in the GAN latent space and therefore the output images, so similar colors, but not the same ones.

Initial image

Sketch

Result

Initial image

Sketch

Result

Initial image

Sketch

Result

Initial image

Sketch

Result

Initial image

Sketch

Result

Initial image

Sketch

Result

Bells & Whistles

High Resolution Grumpy Cats

We used a higher resolution GAN to generate out more detailed grumpy cats and their interpolations! Here are the results:

Original image A

Embedded image A

Interpolation

Embedded image B

Original image B

Original image A

Embedded image A

Interpolation

Embedded image B

Original image B

User interface and demo

In the following gifs, two possible user interfaces and interaction can be seen. The user is able to draw the edition sketches and the model optimizes the best image output that matches the initial image and the edition sketch.

What about Kim?

In the paper Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space? they embed different image classes on the StyleGan latent space trained on the FFHQ dataset. They show that the model is capable of embedding the images even was not trained on those image classes. On the paper we can seee that even these images can be embedded in the latent space, the interpolation between them leads to images with features of the image class it was trained with, in their case, faces. We decided to see what happens with our model when we try to embed images that are not cats. There couldn't be another images to be embedded than Kim Kardashian and the Spanish Ecce Omo. On the following images we can see that, unlike in the paper, our network is not capable or reconstructing the images. In the interpolations, as expected, we can also see the cat features that the network has learned.

Original image A

Embedded image A

Interpolation

Embedded image B

Original image B